Resampling methods are an indispensable tool in modern statistics
We will cover both cross-validation and the bootstrap
Cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility
The bootstrap is used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method
Model assessment
Model selection
We are going to discuss two resampling methods:
Cross-validation
Bootstrap
These methods refit a model of interest to samples formed from the training set, in order to obtain additional information about the fitted model
For example, they provide estimates of test-set prediction error, and the standard deviation and bias of our parameter estimates
Recall the distinction between the test error and the training error:
Test error
Training error
Can be easily calculated by applying the statistical learning method to the observations used in its training
But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter
Training versus test performance
Here we randomly divide the available set of samples into two parts: a training set and a validation or hold-out set
The model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set
The resulting validation-set error provides an estimate of the test error. This is typically assessed using MSE in the case of a quantitative response and misclassification rate in the case of a qualitative (discrete) response
Example
Automobile data
Want to compare linear vs higher-order polynomial terms in a linear regression
We randomly split the 392 observations into two sets, a training set containing 196 of the data points, and a validation set containing the remaining 196 observations
Left panel shows single split
Right panel shows multiple splits
Drawbacks of validation set approach?
the validation estimate of the test error can be highly variable, depending on precisely which observations are included in the training set and which observations are included in the validation set
In the validation approach, only a subset of the observations — those that are included in the training set rather than in the validation set — are used to fit the model
This suggests that the validation set error may tend to overestimate the test error for the model fit on the entire data set
An example of cross-validation
library(tidyverse)
library(ISLR2)
library(rsample)
#set seed
set.seed(1234)
#data
Auto <- ISLR2::Auto
#Split the data into training and validation
split_auto <- initial_split(Auto, prop = 0.5)
train_auto <- training(split_auto)
test_auto <- testing(split_auto)
#estimate a linear model
lm.fit <- lm(mpg ~ horsepower, data = train_auto)
summary(lm.fit)
#MSE
pred.fit <- predict(lm.fit, newdata = test_auto, type = "response")
mean((test_auto$mpg - pred.fit)^2)
#Therefore, the estimated test MSE for the linear regression fit is 23.94
Widely used approach for estimating test error
Estimates can be used to select best model, and to give an idea of the test error of the final chosen model
Idea is to randomly divide the data into \(K\) equal-sized parts. We leave out part \(k\), fit the model to the other \(K − 1\) parts (combined), and then obtain predictions for the left-out \(kth\) part
This is done in turn for each part \(k = 1, 2, ..., k\) and then the results are combined
Example
Let the \(K\) parts be \(C_1, C_2,...,C_k\) denotes the indices of the observations in part \(k\)
There are \(n_k\) observations in part \(k\): if \(N\) is a multiple of \(K\), then \(n_k = n/K\)
Compute \(CV_{(K)} = \sum_{k=1}^{k} \frac{n_k}{n} MSE_k\)
With least-squares linear or polynomial regression, an amazing shortcut makes the cost of LOOCV the same as that of a single model fit! The following formula holds
\(CV_{(n)} = \frac{1}{n} \sum_{i=1}^{n} (\frac{y_i-\hat{y_i}}{1-h_i})^2\)
Where \(y_i\) is the \(ith\) fitted value from the original least squares fit and \(h_i\) is the leverage
LOOCV is sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are highly correlated and hence their average can have high variance
A better choice is \(K = 5\) or 10
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
#set seed
set.seed(1234)
#data
Auto <- ISLR2::Auto
#specify the cross-validation method
ctrl <- trainControl(method = "cv", number = 10)
#fit a regression model and use k-fold CV to evaluate performance
model.fit <- train(mpg ~ horsepower, data = Auto, method = "lm", trControl = ctrl)
#view summary of k-fold CV
print(model.fit)
#view final model
model.fit$finalModel
#view prediction for each fold
model.fit$resample
Since each training set is only \((K-1)/K\) as big as the original training set, the estimates of prediction error will typically be biased upward
This bias is minimized when \(K = n\) (LOOCV), but this estimate has a high variance, as noted earlier
\(K=5\) or 10 provides a good compromise for this bias-variance trade-off
We divide the data into \(K\) roughly equal-sized parts \(C_1, C_2, ..., C_k\)
Where \(C_k\) denotes the indices of the observations in part \(k\)
There are \(n_k\) observations in part \(k\): if \(n\) is a multiple of \(k\), then \(n_k = n/K\)
Compute
\(CV_{(K)} = \sum_{k=1}^{K} \frac{n_k}{n} Err_k\)
The estimated standard deviation of \(CV_K\) is
\(\hat{SE}(CV_K = \sqrt{\frac{1}{K} \sum_{k=1}^{K} \frac{(Err_k-\bar{Err_k})^2}{K-1}}\)
This is a useful estimate, but strictly speaking, not quite valid
Consider a simple classifier applied to some two-class data
Starting with 5000 predictors and 50 samples, find the 100 predictors having the largest correlation with the class labels
We then apply a classifier such as logistic regression, using only these 100 predictors
How do we estimate the test set performance of this classifer?
Can we apply cross-validation in step 2, forgetting about step 1?
NO!
This would ignore the fact that in Step 1, the procedure has already seen the labels of the training data, and made use of them. This is a form of training and must be included in the validation process
It is easy to simulate realistic data with the class labels independent of the outcome, so that true test error =50%, but the CV error estimate that ignores Step 1 is zero
Wrong Way
The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method
For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval for that coefficient
Why is it called the “bootstrap”
The use of the term bootstrap derives from the phrase to pull oneself up by one’s bootstraps, widely thought to be based on one of the eighteenth century “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:
It is not the same as the term “bootstrap” used in computer science meaning to “boot” a computer from a set of core instructions, though the derivation is similar
Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and Y , respectively, where X and Y are random quantities
We will invest a fraction \(\alpha\) of our money in X, and will invest the remaining 1-\(\alpha\) in Y
We wish to choose \(\alpha\) to minimize the total risk, or variance, of our investment. In other words, we want to minimize \(Var(\alpha X + (1-\alpha)Y)\)
One can show that the value that minimizes the risk is given by
\(\alpha = \frac{\sigma_{Y}^{2}-\sigma_{XY}}{\sigma_{X}^{2} + \sigma_{Y}^{2}-2\sigma_{XY}}\)
Where \(\sigma_{X}^{2} = Var(X)\), \(\sigma_{Y}^{2} = Var(Y)\), and \(\sigma_{XY} = Cov(X,Y)\)
But the values of \(\sigma_{X}^{2}\), \(\sigma_{Y}^{2}\), and \(\sigma_{XY}\) are unknown
We can compute estimates for these quantities, \(\hat{\sigma_{X}^{2}}\), \(\hat{\sigma_{Y}^{2}}\), and \(\hat{\sigma_{XY}}\) using a data set that contains measurements for X and Y
We can then estimate the value of \(\alpha\) that minimizes the variance of our investment
Each panel displays 100 simulated returns for investments X and Y . From left to right and top to bottom, the resulting estimates for α are 0.576, 0.532, 0.657, and 0.651
To estimate the standard deviation of \(\hat{\alpha}\), we repeated the process of simulating 100 paired observation of X and Y, and estimating \(\alpha\) 1,000 times
We thereby obtained 1,000 estimates for \(\alpha\), which we can call \(\hat{\alpha_1}, \hat{\alpha_2}, ..., \hat{\alpha_1000}\)
The left-hand panel of the figure below displays a histogram of the resulting estimates
On the left side a histogram of the estimates of \(\alpha\) obtained by generating 1,000 simulated data sets from the true population.
In the center a histogram of the estimates of \(\alpha\) obtained from 1,000 bootstrap samples from a single data set.
On the right the estimates of \(\alpha\) displayed in the left and center panels are shown as boxplots. In each panel, the pink line indicates the true value of \(\alpha\)
The mean over all 1,000 estimates for \(\alpha\) is \(\bar{\alpha} = \frac{1}{1000} \sum_{r=1}^{1000} \hat{\alpha}_r = 0.5996\)
Very close to \(\alpha = 0.6\) and the standard deviation of the estimates is
This gives us a very good idea of the accuracy of \(\hat{\alpha}\): \(SE(\hat{\alpha} \approx 0.083\))
So roughly speaking, for a random sample from the population, we would expect \(\hat{\alpha}\) to differ from \(\alpha\) by approximately 0.08, on average
library(boot)
##
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
##
## melanoma
#set seed
set.seed(1234)
#data
Auto <- ISLR2::Auto
#fit a linear model
fit.lm1 <- lm(mpg ~ horsepower, data = Auto)
summary(fit.lm1)
#bootstrap function
boot.fn <- function(Auto, index){
fit.lm2 <- lm(mpg ~ horsepower, data = Auto[index,])
return(coef(fit.lm2))
}
#bootstrap
bootstrap.coef <- boot(data = Auto, boot.fn, R = 1000)
bootstrap.coef
The procedure outlined above cannot be applied, because for real data we cannot generate new samples from the original population
However, the bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, so that we can estimate the variability of our estimate without generating additional samples
Rather than repeatedly obtaining independent data sets from the population, we instead obtain distinct data sets by repeatedly sampling observations from the original data set with replacement
Each of these “bootstrap data sets” is created by sampling with replacement, and is the same size as our original dataset. As a result some observations may appear more than once in a given bootstrap data set and some not at all
An example with just 3 observations
In more complex data situations, figuring out the appropriate way to generate bootstrap samples can require some thought
For example, if the data is a time series, we can’t simply sample the observations with replacement
We can instead create blocks of consecutive observations, and sample those with replacements. Then we paste together sampled blocks to obtain a bootstrap dataset
Primarily used to obtain standard errors of an estimate
Also provides approximate confidence intervals for a population parameter
The above interval is called a Bootstrap Percentile confidence interval. It is the simplest method (among many approaches) for obtaining a confidence interval from the bootstrap
In cross-validation, each of the K validation folds is distinct from the other K − 1 folds used for training: there is no overlap. This is crucial for its success
To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset as our training sample, and the original sample as our validation sample
But each bootstrap sample has significant overlap with the original data. About two-thirds of the original data points appear in each bootstrap sample
This will cause the bootstrap to seriously underestimate the true prediction error
The other way around— with original sample = training sample, bootstrap dataset = validation sample— is worse!
Removing the overlap
Can partly fix this problem by only using predictions for those observations that did not (by chance) occur in the current bootstrap sample
But the method gets complicated, and in the end, cross-validation provides a simpler, more attractive approach for estimating prediction error
In microarray and other genomic studies, an important problem is to compare a predictor of disease outcome derived from a large number of “biomarkers” to standard clinical predictors
Comparing them on the same dataset that was used to derive the biomarker predictor can lead to results strongly biased in favor of the biomarker predictor
Pre-validation can be used to make a fairer comparison between the two sets of predictors
Idea behind pre-validation
Designed for comparison of adaptively derived predictors to fixed, pre-defined predictors
The idea is to form a “pre-validated” version of the adaptive predictor: specifically, a “fairer” version that hasn’t “seen” the response y
The bootstrap samples from the estimated population, and uses the results to estimate standard errors and confidence intervals
Permutation methods sample from an estimated null distribution for the data, and use this to estimate p-values and False Discovery Rates for hypothesis tests
The bootstrap can be used to test a null hypothesis in simple situations. Eg if \(\theta\) = 0 is the null hypothesis, we check whether the confidence interval for \(\theta\) contains zero
Can also adapt the bootstrap to sample from a null distribution (See Efron and Tibshirani book “An Introduction to the Bootstrap” (1993), chapter 16) but there’s no real advantage over permutations